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Reconstructing propagation networks with natural 
diversity and identifying hidden sources 
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Our ability to uncover complex network structure and dynamics from data is fundamental to 
understanding and controlling collective dynamics in complex systems. Despite recent 
progress in this area, reconstructing networks with stochastic dynamical processes from 
limited time series remains to be an outstanding problem. Here we develop a framework 
based on compressed sensing to reconstruct complex networks on which stochastic 
spreading dynamics take place. We apply the methodology to a large number of model and 
real networks, finding that a full reconstruction of inhomogeneous interactions can be 
achieved from small amounts of polarized (binary) data, a virtue of compressed sensing. 
Further, we demonstrate that a hidden source that triggers the spreading process but is 
externally inaccessible can be ascertained and located with high confidence in the absence of 
direct routes of propagation from it. Our approach thus establishes a paradigm for tracing and 
controlling epidemic invasion and information diffusion in complex networked systems. 
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One of the outstanding problems in interdisciplinary 
science is nonlinear and complex systems identification, 
prediction and control. Given a complex dynamical 
system, the various types of dynamical processes are of great 
interest. The ultimate goal in the study of complex systems is to 
devise practically implementable strategies to control the 
collective dynamics. A great challenge is that the network 
structure and the nodal dynamics are often unknown but only 
limited measured time series are available. To control the system 
dynamics, it is imperative to be able to map out the system details 
from data. Reconstructing complex network structure and 
dynamics from data, the inverse problem, has thus become a 
central issue in contemporary network science and engineering^"^. 
There are broad applications of the solutions of the network 
reconstruction problem, due to the ubiquity of complex interacting 
patterns arising from many systems in a variety of disciplines^" 

An important class of collective dynamics is epidemic 
spreading and information diffusion in the human society or 
on computer networks The past decades have witnessed 
severe epidemic outbreaks at the global scale due to the mutation 
of virus, including SARS^^'^^, H5N1 (refs 23,24), HlNl 
(refs 25,26) and the recent invasion of H7N9 in eastern 
China^^'^^. Our goal is to reconstruct the networks hosting the 
spreading process and identify the source of spreading using 
limited measurements. This is especially challenging due to (1) 
difficulty in predicting and monitoring mutations of deadly virus 
and (2) absence of epidemic threshold in heterogeneous 
networks^^"^^. Another example is rumour propagation in the 
online virtual communities, which can cause financial loss or even 
social instabilities, such as the 2011 irrational and panicked 
acquisition of salt in southeast Asian countries caused by the 
nuclear leak in Japan. In this regard, identifying the propagation 
network for controlling the dynamics is of great interest. Another 
significant challenge in reconstructing a spreading network lies in 
the nature of the available time series: they are polarized, despite 
stochastic spreading among nodes. Indeed, the link pattern and 
the probability of infection are encrypted in the binary status of 
individuals, infected or not, analogous to the collapse of wave 
function to one associated with some discrete quantum state 
induced by observation in quantum mechanics. 

There have been recent efforts in addressing the inverse 
problem of some special types of complex propagation net- 
works^^'^"*. In particular, for diffusion process originated from a 
single source, the routes of diffusion from the source constitute a 
tree-like structure. If information about the early stage of the 
spreading dynamics is available, it would be feasible to decode all 
branches that reveal the connections from the source to its 
neighbours, and then to their neighbours and so on. Taking into 
account the time delays in the diffusion process enables a 
straightforward inference of the source in a complex network 
through enumerating all possible hierarchical trees^ However, 
if no immediate information about the diffusion is available, 
the tree-structure-based inference method is inappHcable, and the 
problem of network reconstruction and locating the source 
becomes extremely challenging, hindering control of diffusion 
and delivery of immunization. The loss of knowledge about the 
source is common in real situations. For example, passengers on 
an international flight can carry a highly contagious disease, 
making certain airports the immediate neighbours of the hidden 
source, which would be difficult to trace. In another example, the 
source could be migratory birds coming from other countries or 
continents. A general data-driven approach, applicable in such 
scenarios, is still lacking. 

In this paper, we develop a general theoretical framework to 
reconstruct complex propagation networks from time series 
based on the compressed sensing theory (CST)^^-^^^ a novel 
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Optimization paradigm for sparse-signal reconstruction with 
broad applications in signal and image processing. Owing to 
the striking characteristics of CST such as the extremely low data 
requirement and rigorous guarantee of convergence to optimal 
solutions, our framework is highly efficient and accurate. 
However, casting the inverse problem into the CST framework 
is highly non-trivial. Although CST has been used to uncover the 
nodal interaction patterns for coupled oscillator networks or 
evolutionary games from time series'* the dynamics of 
epidemic propagation is typically highly stochastic with, for 
example, binary time series, rendering inapplicable the existing 
CST-based formulation. Further, despite the alternative sparsity 
enforcing regularizers and convex optimization used in ref. 44 to 
infer networks, CST has not been applied to reconstructing 
propagation networks, especially when the available time series 
are binary. The main accomplishment of this work is then the 
development of a scheme to implement the highly non-trivial 
transformation associated with the spreading dynamics in the 
paradigm of CST. Without loss of generality, we employ two 
prototypical models of epidemic spreading: classic susceptible- 
infected- susceptible (SIS) dynamics*^ and contact processes 
(CPs)"*^'"*^, on both model and real-world (empirical) networks. 
Inhomogeneous infection and recovery rates as representative 
characteristics of the natural diversity are incorporated into the 
diffusion dynamics to better mimic the real-world situation. We 
assume that only binary time series can be measured, which 
characterize the status of any node, infected or susceptible, at any 
time after the outbreak of the epidemic. The source that triggers 
the spreading process is assumed to be externally inaccessible 
(hidden). In fact, one may not even realize its existence from 
available time series. Our method enables, based on relatively 
small amounts of data, a full reconstruction of the epidemic 
spreading network with nodal diversity and successful 
identification of the immediate neighbouring nodes of the 
hidden source (thereby ascertaining its existence and uniquely 
specifying its connections to nodes in the network). The 
framework is validated with respect to different amounts of 
data generated from various combinations of the network 
structures and dynamical processes. High accuracy, high 
efficiency and applicability in a strongly stochastic environment 
with measurement noise and missing information are the 
most striking characteristics of our framework. Thus, broad 
applications can be expected in addressing significant problems 
such as targeted control of disease and rumour spreading. 

Results 

Compressed sensing. The general problem that CST addresses is 
to reconstruct a vector X e from linear measurements Y about 
X in the form 

Y^O-X, (1) 

where Y e and O is an M x N matrix. The striking feature of 
CS is that the number of measurements can be much less than the 
number of components of the unknown vector, that is, M«N, 
insofar as X is sparse and the number of non-zero components in 
it is less than M. Accurate reconstruction can be achieved by 
solving the following convex-optimization problem^^: 

min 1 1 X I |i subject to Y = O • X, (2) 

where || X ||i= X^jli | Xj | is the Li norm of X and the matrix O 
satisfies the restricted isometry property. Solutions to the convex 
optimization are now standard^^""*^. (More details of the CST can 
be found in Supplementary Note 1.) Our goal is to develop a 
framework to cast the problem of reconstructing propagation 
networks into form (1). 
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Reconstruction framework. To present our framework in a 
transparent manner, we first consider the relatively simple case 
where there is no hidden source. Further, we assume that the 
disease starts to propagate from a fraction of the infected nodes. 
As we will see, based on this framework, it is feasible to locate any 
hidden source based solely on time series after outbreak of 
infection. The state of an arbitrary node / is denoted as S/, where 

_ f 0, susceptible; . . 

' ~ \ 1, infected. ^ ^ 

Owing to the characteristic difference between the SIS dynamics 
and CP, we treat them separately (see Methods). 

For the SIS dynamics, the probability P^^{t) of an arbitrary 
node / being infected by its neighbors at time t is 

p^\t) = i-{i-i,y-^^^^' , (4) 

where is the infection rate of /, stands for the elements of the 
adjacency matrix (a^- = 1 if / connects to j and = 0 otherwise), 
Sj{t) is the state of node; at U and the superscript 01 denotes the 
change from susceptible state (0) to infected state (1). At the same 
time, the recovery probability of / is Pj'^^t) = Si, where 5/ is the 
recovery rate of node / and the superscript 10 denotes the 



transition from infected state to susceptible state. Equation (4) 
can be rewritten as 

If measurements at different times t= ti, ■ ■ • > are available, 
equation (5) can be written in the matrix form x i 
= ^mx(N-i)-X(A^-i)xi, where Y contains - P^^ {t)] at 
different U ^ is determined by the state Sj{t) of nodes except /, 
and X comprising the links and infection rates of / is sparse for a 
general network (see Methods). The main challenge here is that 
the infection probabilities Pf^(t) at different times are not given 
directly from the time series of the nodal state. 

To develop a method to estimate the probability from the 
nodal states, we set a threshold A pertaining to the normalized 
Hamming distance between strings composed of Sj{t) (j^i) at 
different t to identif)^ a base string at and a set of strings subject 
to the base. According to the law of large numbers, the probability 
P^^{ia) can be estimated by the average over the state S/(t+ 1) at 
all proper time. By setting another threshold 0 associated with 
the normalized Hamming distance, we can identif)^ a set of base 
strings. This process finally gives rise to a set of reconstruction 
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Figure 1 | Schematic illustration of building up Y and O from binary time series, (a) Fourteen snapshots of data at time instants — of eight 
nodes in a sannple network, where S,- is the time series of node 2 and S_ / denotes the strings of other nodes at different times. The neighbourhood of 
node 2 is to be reconstructed. Only the pairs 00 and 01 in the time series of S, (/ = 2) and the corresponding S_/ contain useful information about 
the network, as marked by different colours, (b) As S/(f + 1) is determined by the neighbours of / and S_/(0, we sort out S/(t + 1) and S_/(0 in the 
coloured sections of the time series in a. According to the threshold parameters A =3/7 and 0 = 3/7, we calculate the normalized Hamming 
distance between each pair of strings S_/(0, finding two base strings at ?i = U and f2 = tn with H[S_/(Ji), S_/(t2)] >0. We separate the coloured 
strings into two groups that are led by the two base strings, respectively. In each group, the normalized Hamming distance H[S_/(ta), S_/(fv)] between 
the base string and other strings is calculated and the difference from S_/(ta) in each string is marked by red. Using parameter A, in the group led by 
S_/(ti), 5_iits) and S/Cte) are preserved, because of H[S_/(ti), S_/(t5)] < A. In contrast, S_j(tj) and S/Cts) si's disregarded because /-/[S_/(ti), S_/(t7)] > A. 
In the group led by S_/(f2), due to /-/[S_/(?2), S_/(fi3)] < A, the string is preserved. The two sets of remaining strings marked by purple and green can 
be used to yield the quantities required by the reconstruction formula. (Note that different base strings are allowed to share some strings, but for simplicity, 
this situation is not illustrated here. See Supplementary Note 3 for a detailed discussion.) (c) The average values (S/(ta + 1)) and (S_/(ta)) used to extract 
the vector Yand the matrix ^ in the reconstruction formula, where (S_/(ti)) = [S_/(ti) +S_/(t5)]/2, (S_/(t2)) = [S_/(fii) +S_/(fi3)]/2, (S/(ti+1)) = 
[5,(^2) +5/(^6 )]/2 and (S/(f2+1)) = [5,(^12) +5/ (fi4)]/2 based on the remaining strings marked in different colours (see Methods for more details). 
CSTcan be used to reconstruct the neighbouring vector X of node 2 from Yand ^ from Y = ^-X. 
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equations in the matrix form: 
■ln[l-(S,ft + l))]" 
\n[l -{Siih + m 

.ln[l-(Si(?„ + l))]. 

(Si(fi)) ••• (Si-iCh)) (S;+i(fi)) 
{Siih)) ■■■ {Si-iih)) (Si+iih)) 



(SiiD) ■■■ {Si-iiD) {Si+iiim)) 
ln(l — Xi)aii 



(SnOi)) 

{SNih)) 
{SN{tm)) 



ln(l -2/)a/,/_i 
ln(l -2/)a/,/+i 

(6) 

where ii^hr ' ' ^im correspond to the time associated with m base 
strings and < • > denote the average over all satisfied t (see 
Methods). The vector x i and the matrix ^ 



m X {N — 1) 



can then 



be obtained based solely on time series of nodal states and the 
vector X(^^_i)xi to be reconstructed is sparse, rendering 
applicable the CS framework. As a result, we can achieve exact 



reconstruction of all neighbours of node / from relatively small 
amounts of observation. In a similar manner the neighbouring 
vectors of all other nodes can be uncovered from time series, 
enabling a full reconstruction of the whole network by matching 
the neighboring sets of all nodes. 

For the CP dynamics, the infection probability of an arbitrary 
node / is given by 



(7) 



where ki is the degree of the node /, and the recovery probability is 
Pj^{t) = Si (see Methods). In close analogy to the SIS dynamics, 
we have 



(S,{l + l)) (Pf'(i)) = 



k, 



(8) 



We then choose a series of base strings using a proper threshold 
0 to establish a set of equations, expressed in the matrix form 
Ymxi = ^mx(N-i)-X(7v-i)xi (see Supplementary Note 2), 
where O has the same form as in equation (6), but Y and X 
are given by 



Y = [{Si{ii + 1)), {Siih + 1)), • • • , {SiOm + 



X = 



-a 



(9) 



Our reconstruction framework based on establishing the vector Y 
and the matrix O is schematically illustrated in Fig. 1. It is 
noteworthy that our framework can be extended to directed 
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Figure 2 | Network reconstruction performance, (a) Element values ln(1 — 'k^Oij of vector X times -1 for different fraction of base strings for 
SIS dynamics. (b,c) Success rate (SREL and SRNC) and conflict rate (CR) of reconstruction as a function of for SIS dynamics on Newman-Watts (NW) 
small-world networks (b) and CP dynamics on Erdos-Renyi (ER) random networks (c). For the SIS dynamics, the parameters are 0 = 0.25, A =0.45 and 
the infection and recovery rates X-, and d-, are randomly distributed in the ranges (0.2, 0.4) and (0.4, 0.6), respectively. For the CP dynamics, the parameters 
are 0 = 0.35, A = 0.45 and and d; are randomly distributed in the ranges (0.7, 0.9) and (0.2, 0.4), respectively. The network size N is 200 with average 
node degree </c> =4. The results are obtained by ensemble averaging over ten independent realizations. The success rate is determined by setting a 
cut-off according to Supplementary Fig. la and the method described in Supplementary Note 4. 
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networks in a straightforward manner due to the feature that the 
neighbouring set of each node can be independently recon- 
structed. For instance, the neighbouring vector X can be defined 
to represent a unique link direction, for example, incoming links. 
Inference of the directed links of all nodes yields the full topology 
of the entire directed network. 

Reconstructing networks and infection and recovery rates. To 

quantify the performance of our method in terms of the number 
of base strings (equations) for a variety of diffusion dynamics and 
network structures, we study the success rates for existent links 
(SRELs) and null connections (SRNCs), corresponding to non- 
zero and zero element values in the adjacency matrix, respec- 
tively. We impose the strict criterion that the network is regarded 
to have been fully reconstructed if and only if both success rates 
reach 100%. The sparsity of links makes it necessary to define 
SREL and SRNC separately. As the reconstruction method is 
implemented for each node in the network, we define SREL and 
SRNC on the basis of each individual node and the two success 
rates for the entire network are the respective averaged values 
over all nodes. We also consider the issue of trade-off in terms of 
the true positive rate (for correctly inferred links) and the false 
positive rate (for incorrectly inferred links). 

Here we assume that there is no hidden source and the 
spreading process starts from a fraction of infected nodes, and 
record the binary time series. Figure 2a shows the reconstructed 
values of the components of the neighbouring vector X of all 
nodes. Let be the number of base strings normalized by the 
network size N. For small values of for example, = 0.1, the 
values of elements associated with links and that associated with 
null connections (actual zeros in the adjacency matrix) overlap, 
leading to ambiguities in the identification of links. In contrast, 
for larger values of for example, — 0.4, an apparent gap 
emerges between the two groups of element values, enabling us to 
correctly identif)^ all links by simply setting a cut-off within the 
gap (see Supplementary Fig. la and Supplementary Note 4 for a 
method to set the cut-off). The success rates (SREL and SRNC) as 
a function of for SIS and CP on both homogeneous and 
heterogeneous networks are shown in Fig. 2b,c, where we observe 
nearly perfect reconstruction of links insofar as rij exceeds a 
relatively small value— an advantage of compressed sensing. The 
exact reconstruction is robust in the sense that a wide range of rij 
values can yield nearly 100% success rates. Our reconstruction 
method is then effective for tackling real networks in the absence 
of any a priori knowledge about its topology. In particular, the 
existence of a clear gap in the reconstructed vector X represents a 
successful reconstruction for a real network. 

Note that a network is reconstructed through the union of all 
neighbourhoods, which may encounter 'conflicts' with respect to 
the presence/ absence of a link between two nodes as generated by 
reconstruction centred at the two nodes. The conflicts would 
reduce the accuracy in the reconstruction of the entire network. 
To characterize the effects of edge conflicts, we study the 
consistency of mutual assessment of the presence or absence of 
link between each pair of nodes, as shown in Fig. 2b,c. We see 
that inconsistency arises for small values of rij but vanishes 
completely when the success rates reach 100%, indicating 
complete consistency among the mutual inferences of nodes 
and consequently guaranteeing accurate reconstruction 
of the entire network. Detailed results of success rates and 
trade-off measures with respect to a variety of model and real 
networks are displayed in Table 1, Supplementary Figs 2 and 3. 
and Supplementary Note 5. 

Although the number of base strings is relatively small 
compared with the network size, we need a set of strings at 
different time with respect to a base string to formulate the 



mathematical framework for reconstruction. We study how the 
length of time series affects the accuracy of reconstruction. 
Figure 3a,b show the success rate as a function of the relative 
length Hf of time series for SIS and CP dynamics on both 
homogeneous and heterogeneous networks, where rit is the total 
length of time series from the beginning of the spreading process 
divided by the network size N. The results demonstrate that even 
for very small values of n^, most links can already be identified, as 
reflected by the high values of the success rate shown. Figure 3c,d 
show the minimum length n^^^ required to achieve at least 95% 
success rate for different network sizes. For both SIS and CP 
dynamics on different networks, n^^^ decreases considerably as 
N is increased. This seemingly counterintuitive result is due to the 
fact that different base strings can share strings at different times 
to enable reconstruction. In general, as N is increased, will 
increase accordingly. However, a particular string can belong to 
different base strings with respect to the threshold A, accounting 
for the slight increase in the absolute length of the time series (see 
Supplementary Fig. 4 and Supplementary Note 5) and the 
reduction in n^^^ (see Supplementary Note 3 on the method to 
choose base and subordinate strings). The dependence of the 
success rate on the average node degree < A; > for SIS and CP on 
different networks has been investigated as well (see 
Supplementary Fig. 5 and Supplementary Note 5). The results 
in Figs 2 and 3, Supplementary Figs 2-5 and Table 1 demonstrate 
the high accuracy and efficiency of our reconstruction method 
based on small amounts of data. 

In practice, noise is present and it is also common for time 
series from certain nodes to be missing, and it is necessary to test 
the applicability of our method in more realistic situations. 
Figure 4a,b show the dependence of the success rate on the 
fraction rif of states in the time series that flip due to noise for SIS 
and CP dynamics on two types of networks. We observe that the 
success rates are hardly affected, providing strong evidence for the 
applicability of our reconstruction method. For example, even 
when 25% of the nodal states flip, we can still achieve about 80% 
success rates for both dynamical processes and different network 
topologies. Figure 4c,d present the success rate versus the fraction 
n^i of unobservable nodes, the states of which are externally 
inaccessible. We find that the high success rate remains mostly 
unchanged as is increased from 0 to 25%, a somewhat 
counterintuitive but striking result. The high degree of robustness 
against the limit to accessing nodal states is elaborated further in 
Supplementary Fig. 6 and Supplementary Note 5. We find that, in 
general, missing information can affect the reconstruction of the 
neighbouring vector, as reflected by the reduction of the gap 
between the reconstructed values associated with actual links and 
null connections. However, even for high values of n^, for 
example, nm = 0.3, there is still a clear gap, indicating that a full 
recovery of all links is achievable. We have also found that 
our method is robust against inaccurately specified diffusion 
processes with fluctuation in infection rates (see Supplementary 
Fig. 7 and Supplementary Note 5). Taken together, the high 
accuracy, efficiency and robustness against noise, missing 
information and inaccurately modelling of real dynamical 
processes provide strong credence for the validity and power of 
our framework for binary time-series-based network 
reconstruction. 

Having reconstructed the network structure, we can estimate 
the infection and recovery rates of individuals to uncover their 
diversity in immunity. This is an essential step to implement 
target vaccination strategy in a population or on a computer 
network to effectively suppress/prevent the spreading of virus at 
low cost, as a large body of literature indicates that knowledge 
about the network structure and individual characteristics is 
sufficient for controlling the spreading dynamics"*^"^^. Here we 
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Table 1 | Performances of reconstruction and locating hidden source. 
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WS 


1.0 


1.0 


1.0 


0.0 


0.009 


0.0 


0.048 


0.936 


0.068 


ER 


0.999 


1.0 


1.0 


0.0 


0.01 


0.0 


0.073 


0.925 


0.327 


BA 


0.997 


1.0 


1.0 


0.0 


0.008 


0.0 


0.043 


0.943 


0.08 


Prison 


0.995 


0.996 


0.996 


0.004 


0.005 


0.0 


0.018 


0.911 


0.012 


Santafe 


0.984 


0.996 


0.996 


0.004 


0.006 


0.0 


0.036 


0.929 


0.034 


Netscience 


0.996 


0.999 


0.996 


0.001 


0.007 


0.0 


0.166 


1.0 


0.050 


NW 


1.0 


1.0 


1.0 


0.0 


0.009 


0.0 


0.052 


0.98 


0.034 


ZK 


0.992 


0.992 


0.992 


0.008 


0.007 


0.001 


0.022 


0.977 


0.028 


Polbooks 


0.973 


0.995 


0.973 


0.005 


0.008 


0.0 


0.042 


0.829 


0.386 


Football 


0.995 


0.997 


0.995 


0.003 


0.006 


0.0 


0.028 


0.517 


0.015 


Dolphin 


0.952 


0.971 


0.971 


0.029 


0.006 


0.0 


0.026 


0.786 


0.059 


Leadership 


0.996 


0.993 


0.993 


0.007 


0.006 


0.0 


0.013 


0.857 


0.075 



FPR, false positive rate; LHS, locating hidden source; SREL, success rates for existent link; SRNC, success rates for null connection; TPR, true positive rate. 

The accuracy of network reconstruction is quantified by the success rates SREL and SRNC, as well as the trade-off measures TPR and FPR. The accuracy in determining the values of the infection rate X is 
characterized by the relative mean errors, the minimum and maximum errors. The accuracy of LHS is characterized by the trade-off measures TPR and FPR. The results of network reconstruction and 
error in A are obtained from 30 independent realizations. The results of LHS is obtained from ten dynamical realizations and ten configurations of the hidden source. Other parameters are the same as in 
Fig. 2. For data sources, reference and network models, see Supplementary Table 1 and Supplementary Note 10. 



offer an effective method to infer the individuals' infection rates 
based solely on the binary time series of the nodal states after an 
outbreak of contamination. (To our knowledge, there was no 
prior work addressing this critical issue.) In particular, after all 
links have been successfully predicted, li can be deduced from 
the infection probabilities that can be approximated by the 
corresponding infection frequencies (see Methods). These 
probabilities depend on both and the number of infected 
neighbours. The reproduced infection rates of individuals for 
both SIS and CP dynamics on different networks are in quite 
good agreement with the true values with small prediction errors 
(see Supplementary Fig. 8 and Supplementary Note 6). Results 
from a comprehensive error analysis are listed in Table 1, where 
the uniformly high accuracy validates our method. The 
inhomogeneous recovery rates Si of nodes can be predicted 
from the binary time series in a more straightforward way, 
because Si do not depend on the nodal connections (see 
Supplementary Fig. 9 and Supplementary Note 6). Thus, our 
framework is capable of predicting characteristics of nodal 
diversity in terms of degrees, and infection and recovery rates 
based solely on binary time series of nodal states. 

Locating the hidden source of propagation. We assume that a 
hidden source exists outside the network but there are connec- 
tions between it and some nodes in the network. In practice, the 
source can be modelled as a special node that is always infected. 
Starting from the neighbourhood of the source, the infection 
originates from the source and spreads all over the network. We 
collect a set of time series of the nodal states, except the hidden 
source (see Methods). The basic idea of ascertaining and locating 
the hidden source is based on missing information from the 

6 



hidden source when attempting to reconstruct the network. In 
particular, to reconstruct the connections belonging to the 
immediate neighbourhood of the source accurately, time series 
from the source are needed to generate the matrix O and the 
vector Y. However, as the source is hidden, no time series from 
it are available, leading to reconstruction inaccuracy and, 
consequently, anomalies in the predicted link patterns of the 
neighbouring nodes. It is then possible to detect the neighbour- 
hood of the hidden source by identifying any abnormal connec- 
tion patterns^ ^ which can be accomplished by using different 
data segments. If the inferred links of a node are stable with 
respect to different data segments, the node can be deemed to 
have no connection with the hidden source; otherwise, if the 
result of inferring a node's links varies significantly with respect 
to different data segments, the node is likely to be connected to 
the hidden source. The s.d. of the predicted results with respect to 
different data segments can be used as a quantitative criterion 
for the anomaly. Once the neighbouring set of the source is 
determined, the source is then precisely located topologically. 

Figure 5 presents an example, where a hidden source is 
connected with four nodes in the network (Fig. 5a), as reflected in 
the network adjacency matrix (Fig. 5b). We implement our 
reconstruction framework on each accessible node by using 
different sets of data in the time series. For each data set, we 
predict the neighbours of all nodes and generate an adjacency 
matrix. Averaging over the elements corresponding to each 
location in all the reconstructed adjacency matrices, we obtain 
Fig. 5c, in which each row corresponds to the mean number of 
links in a node's neighbourhood. The inferred links of the 
immediate neighbours of the hidden source exhibit anomalies. 
To quantif)^ the anomalies, we calculate the structural s.d. a from 
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Figure 3 | Effect of the length of time series and network size. (a,b) Success rate as a function of the relative length rif of time series for SIS (a) and CP 

(b) in combination with Newman-Watts (NW) and Barabasi-Albert (BA) networks. The dashed lines represent 95% success rate. (c,d) the minimum 
relative length n™" that assures at least 95% success rate as a function of the network size N for SIS (c) and CP (d) dynamics on NW and BA networks. 
Here the success rate is the geometric average over SREL and SRNC. For SIS dynamics, Ag(0.1, 0.3) and 3e(0.2, 0.4). For CP dynamics, Ag(0.7, 0.9) and 
3 G (0.3, 0.5). In a and b, the network size N is 500. The other parameters are the same as in Fig. 2. Note that nt=t/N, where t is the absolute length of time 
series, and n™" = tm\n/N, where tmin is the minimum absolute length of time series required for at least 95% success rate. 
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Figure 4 | Reconstruction against noise and inaccessible nodes. (a,b) Success rate as a function of the fraction nf of the flipped states induced by noise in 
time series for SIS (a) and CP (b) dynamics on Erdos-Renyi (ER) and Barabasi-Albert (BA) networks. (c,d) Success rate as a function of the fraction of 
nodes that are externally inaccessible for SIS (c) and CP (d) dynamics on NW and BA networks. The network size N is 500 and <^> =4. The other 
parameters are the same as in Fig. 2. 
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Figure 5 | Locating an external hidden source from time series, (a) Hidden source treated as a special node (in red) is connected to four nodes in the 
network (blue). The time series of other nodes except the source (number 50) after the outbreak of an epidemic are assumed to be available, (b) True 
adjacency matrix of the NW network with identical link weights to facilitate a comparison with the reconstructed adjacency matrix, (c) Reconstructed 
adjacency matrix from a number of segments in time series. The four neighbouring nodes of the source are predicted to be densely linked to other nodes, as 
indicated by the average value of the elements in the four rows corresponding to these nodes, (d) Structural variance o of each node. The four neighbouring 
nodes of the source exhibit much larger values of a than those from the other nodes, providing unequivocal evidence that they belong to the immediate 
neighbourhood of the hidden source. 



different data segments, where a associated with node / is defined 
through the /th row in the adjacency matrix as 



(^i — 



1 ^ 



(10) 



where j denotes the column, a|^^^ represents the element value in 
the adjacency matrix inferred from the /cth group of the data, 
{uij) — {\/g) X^f^i is the mean value of and ^ is the number 
of data segments. Applying equation (10) to the reconstructed 
adjacency matrices gives the results in Fig. 5d, where the values of 
G associated with the immediate neighbouring nodes of the 
hidden source are much larger than those from others (which are 
essentially zero). A cut-off value can be set in the distribution of 
Gi to identif)^ the immediate neighbours of the hidden source 
(see Supplementary Fig. lb and Supplementary Note 4). The 
performance of locating hidden source by means of the trade-off 
measures (true positive rate versus false positive rate) are 
displayed in Table 1. 

Discussion 

We have developed a general framework to reconstruct complex 
propagation networks on which epidemic spreading takes place 
from binary time series. Our paradigm is based on compressed 
sensing, is completely data-driven and practically significant for 
controlling epidemic spreading through targeted vaccination. 
Both theoretically and practically, our framework can be used to 
address the extremely challenging problem of reconstructing the 
intrinsic interacting patterns of complex stochastic systems based 
on small amounts of polarized time series. The key to success 
of our method lies in our development of a novel class of 
transformation, allowing the network inference problem to be 
converted to the problem of sparse signal reconstruction, which 
can then be solved by the standard compressed- sensing 
algorithm. The accuracy and efficiency of our framework in 
uncovering the network structure, the natural diversity in the 
nodal characteristics, and any hidden source are guaranteed by 
the CST with rigorous proof for low-data requirement and 
convergence to optimal solution. The feasibility of our framework 
has been demonstrated using a large number of combinations of 
epidemic processes and network structures, where in all cases 
highly accurate reconstruction is achieved. Our approach opens 
up a new avenue towards fully addressing the inverse problem in 
complex stochastic systems in a highly efficient manner, a 
fundamental stepping stone towards understanding and control- 
ling complex dynamical systems in general. 



We have focused on two types of spreading dynamics, SIS and 
CP, where an infected individual can recover and becomes 
susceptible again. In this regard, even if an outbreak occurs, 
control strategy such as targeted vaccination or quarantine can be 
helpful to eliminate the virus eventually. A main purpose of our 
work is to identify the key individuals in the network to 
implement target control and to locate the source of infection to 
isolate it so as to prevent recurrent infection in the future. 
Although for any spreading dynamics, the most effective way to 
prevent a large-scale outbreak is to implement control during the 
early stage, this may be impractical in many situations. If we miss 
the early stage, which is possible especially in complex networks 
where the epidemic threshold can be near zero, to be able to 
reconstruct the spreading network is of tremendous value. Besides 
disease spreading, our framework is applicable to rumour or 
information spreading. In this case, identif)^ing the source of 
rumour is important, a problem that our framework is capable of 
solving. 

Our work raises a number of questions to further and perfect 
the theoretical and algorithmic development of reconstructing 
complex dynamical systems. For example, if partial knowledge 
about the network structure is available, the information can be 
incorporated into our framework to further reduce the required 
data amount. Moreover, for non-Markovian spreading processes, 
our current reconstruction framework may fail. This raises the 
need to develop new and more general approaches. Nevertheless, 
our theory, due to its generality and applicability to various types 
of inhomogeneous interactions, can be applied to networks of 
networks or interdependent networks, in which there may be 
different spreading patterns associated with distinct layers or 
components. Taken together, our results provide strong credence 
to the proposition that complex networks can be fully decrypted 
from measurements, even when stochastic disturbance and 
hidden sources are present. This can offer a deeper understanding 
of complex systems in general and significantly enhance our 
ability to control them based on, for example, the recently 
developed controllability theory of complex networks^^"^^. 



Methods 

Spreading processes. The SIS model is a classic epidemic model that has been 
used frequently to study a variety of spreading behaviours in social and computer 
networks. Each node of the network represents an individual and links are con- 
nections along which infection can propagate to others with certain probability. 
At each time step, a susceptible node i in state 0 is infected with rate if it is 
connected to an infected node in state 1. If z connects to more than one infected 
neighbour, the infection probability P^^ is given by equation (4). At the same time, 
infected nodes are continuously recovered to be susceptible at the rates d-^. The CP 
model has been used extensively to describe, for example, the spreading of infection 
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and competition of animals over a territory, where 1; is determined by equation (7). 
The main difference between SIS and CP dynamics lies in the influence on a node's 
state from its vicinity. In both SIS and CP dynamics, I, and di depend on the 
individuals' immune systems and are selected from a uniform distribution char- 
acterizing the natural diversity (see Supplementary Note 7 for details of numerical 
simulations). Moreover, a hidden source is regarded as infected for all time. 

Mathematical formulation of reconstruction based on CST. For SIS dynamics, 
suppose measurements at a sequence of times t=ti, 
equation (5) leads to the following matrix form ; 



= 0„ 



, tm are available. 

X (N- 1) • ^{N- 1) X 



ln[l- 
ln[l- 



\n[l -Pf{tm)\ 



Slih) 

Sl{t2) 



Si-i{h) 



Si-l(tm) Si+i{tm) 



SNitm)i 



\n{l-li)aii 

ln(l -l/)a,-/_i 
ln(l — li)ai^i+i 



ISiitm) 

. \n{l -li)aiN . 

where the vector X(;v i) x i contains all possible connections between node i and all 
other nodes, and it is sparse for a general complex network. We see that if the vector 
X 1 and the matrix x (n- i) can be constructed from time series, X(7^_ i) x i can 
then be solved by using CST. The main challenge here is that the infection prob- 
abilities P^^ (t) at different times are not given directly by the time series of the nodal 
states. To devise a heuristic method to estimate the probabilities, we assume that the 
neighbouring set Ti of the node i is known. The number of such neighbouring nodes 
is given by /c,, the degree of node i and their states at time t can be denoted as 

Sr,W = {SM,S2{t),---,S,Xt)}. (11) 

To approximate the infection probability, we use Si{t) = 0 so that at f + 1, the node i 
can be infected with certain probability. In contrast, if Si{t) = 1, Si{t-\- 1) is only 
related with the recovery probability 6,. Hence, we focus on the Si{t) = 0 case to 
derive Pf^(0- If can find two time instants: ti, t2ET {T is the length of time 
series), such that S,(fi) = 0 and 5,(^2) = 0, we can then calculate the normalized 
Hamming distance H[Sr, (fi), Sp, (^2)] between SrXh) and Sr,(f2)> where the nor- 
malized Hamming distance between two strings of equal length is defined as ratio of 
the number of positions with different symbols between them and the length of 
string. If H[Sr, (fi), Sr, (t2)] = 0, we can regard the states at the next time step, 
Siiti + 1) and 5/(^2 + 1), as i.i.d Bernoulli trials. In this case, using the law of large 
numbers, we have 

1 ' 

lim-VsKtv + l)^P?Hi), yt,„Si{t„)=0, H[Sr,{i.),Sr,M]=0. (12) 

1^00 / 

A more intuitive understanding of (equation (12)) is that if the states of is neigh- 
bours are unchanged, the fraction of times of i being infected by its neighbours over 
the entire time period will approach the actual infection probability P^^ . Note, 
however, that the neighbouring set of i is unknown and to be inferred. A strategy is 
then to artificially enlarge the neighbouring set SrXt) to include aU nodes in the 
network, except i. In particular, we denote 



s_i{t) = {Si(0,S2(0, ••• ,s,_i(0,s,-+i(0 

If H[S„,(fi), S_i{t2)]=0, the condition H[Sr,(fi), Sr,(f2)] 
Consequently, due 
numbers, we have 



(13) 

0 will be ensured. 

Consequently, due to the nature of i.i.d Bernoulli trials, from the law of large 



lim\y Si{U, + I) ^P^\k), V fv,S,-(fv) = 0, H[S-i{k),S-i{U,)]=0. 
1^00 I 

Hence, the infection probability P'^^ (t^) of a node at can be evaluated by averaging 
over its states associated with zero-normalized Hamming distance between the 
strings of other nodes at some time associated with tot. In practice, to find two strings 
with absolute zero -normalized Hamming distance is unlikely. We thus set a 
threshold A so as to pick the suitable strings to approximate the law of large 
numbers, that is 



yf^P»'(t,.), V <„S,(<v) = 0, H[S_,('»),S-,(t,.)]<A, (14) 



where S_f(fa) serves as a base for comparison with S^i{t) at all other times 
and\Y.i=i - P^\h). As H[S.i(iot),S.i{t,)] is not exactly zero, there is a 

small difference between Pf^(fa) and P|'^(tv) (v = 1, • • • , /). We thus consider the 
average of P^^iU,) for all ty to obtain PP(fa), leading to the right-hand side of 
equation (14). We denote (S,(t„ + 1)> = \J2i=i SiiU + 1) and (P?Hi)> = 
]J2l=i P-\t,). To reduce the error in the estimation, we implement the average on 



S^iit) over all selected strings through equation (14). The averaging process is with 
respect to the nodal states Sjj^^iit) on the right-hand side of the modified dynamical 
equation (5). Specifically, averaging over time t restricted by equation (14) on both 
sides of equation (5), we obtain {\n[l - P^\t)]) = ln(l -^^ T.f=i,j^i(iij{Sjit)). 
If is small with insignificant fluctuations, we can approximately have 
ln[l - (P?nO>] - (ln[l -PfUOl) (see Supplementary Fig. 10 and Supplementary 
Note 8), which leads to ln[l - {P^\t))] ln(l - Xt) X;;=i,;Vi«y(S;(0>- 
Substituting {P-\k)) by (S/(fa + 1)>, we finally get 



ln[l-(S.(t« + l)>] ^Hl-X,). ^2 ^rjiSjil))- 



(15) 



Although the above procedure yields an equation that bridges the links of an arbi- 
trary node i with the observable states of the nodes, a single equation does not 
contain sufficient structural information about the network. Our second step is then 
to derive a sufficient number of linearly independent equations required by CST to 
reconstruct the local connection structure. To achieve this, we choose a series of base 
strings at a number of time instants from a set denoted by Tbase' in which each pair of 
strings satisfy 



(16) 



where ia and tp correspond to the time instants of two base strings in the time series 
and 0 is a threshold. For each string, we repeat the process of establishing the 
relationship between the nodal states and connections, leading to a set of equations at 
different values oi ia in equation (15), as described in the matrix form (equation (6)). 
See Supplementary Fig. 11, 12 and Supplementary Note 8 for the dependence of 
success rate on threshold A and 0 for SIS and CP dynamics in combination with 
four types of networks. 

Inferring inhomogeneous infection rates. The values of the infection rate /Ij of 
nodes can be inferred after the neighbourhood of each node has been successfully 
reconstructed. The idea roots in the fact that the infection probability of a node 
approximated by the frequency of being infected calculated from time series is 
determined both by its infection rate and by the number of infected nodes in its 
neighbourhood. To provide an intuitive picture, we consider the following simple 
scenario in which the number of infected neighbours of node i does not change 
with time. In this case, the probability of i being infected at each time step is fixed. 
We can thus count the frequency of the 01 and 00 pairs embedded in the time 
series of i. The ratio of the number of 01 pairs over the total number of 01 and 00 
pairs gives approximately the infection probability. The infection rate can then be 
calculated by using equations (4) and (7) for the SIS and CP dynamics, respectively. 
In a real-world situation, however, the number of infected neighbours varies with 
time. The time-varying factor can be taken into account by sorting out the time 
instants corresponding to different numbers of the infected neighbours, and the 
infection probability can be obtained at the corresponding time instants, leading to 
a set of values for the infection rate whose average represents an accurate estimate 
of the true infection rate for each node. 

To be concrete, considering all the time instants associated with ki infected 
neighbors, we denote sf'^ = (1//) J2i=i Si{tv + 1), V U, J^^.^^, '^;(^v) = h and 
Sf(fv) = 0, where is the neighbouring set of node /, ki is the number of infected 
neighbours and sf^^"^ represents the average infected fraction of node i with k-, 
infected neighbours. Given sf^\ we can rewrite equation (4) by substituting s\ for 
P?HO and if^^ for Xi, which yields if'^ = 1 - exp[ln(l - sf'^)/ki]. To reduce the 
estimation error, we average /if with respect to different values of ki, as follows: 



if ^ (SIS) 



1 



(fci) 



(17) 



where A^ denotes the set of all possible infected neighbours during the epidemic 
process and Na, denotes the number of different values of ki in the set. Analogously, 
for CP, we can evaluate Xf^^ from equation (7) by 



(18) 



where ki = J2jLi ^ij Is the node degree of i. Insofar as all the links of i have been 
successftilly reconstructed, sf^^ can be obtained from the time series in terms of the 
satisfied Si{ty+ 1), allowing us to infer if^^ via equations (17) and (18). 

Note that the method is applicable to any type of networks insofar as the 
network structure has been successfully reconstructed. 

Networks analysed. Model networks and real networks we used are described in 
Supplementary Note 10 and Table 1. 
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